Consistent deduplicated snapshot generation for a distributed database using optimistic deduplication

ABSTRACT

Embodiments disclosed herein provide systems, methods, and computer readable media for generating a consistent deduplicated snapshot of a distributed database using optimistic deduplication. In a particular embodiment, a method provides, for each node of a plurality of nodes in the distributed database, deduplicating data items stored on the node that are owned by the node and generating a summary that describes a file in which the data items are located. The method further provides identifying from the summaries for each of the nodes whether mistakes occurred during deduplication and, upon identifying one or more mistakes, determining one or more corrections for the one or more mistakes. Also, the method provides generating a consistent deduplicated snapshot for the distributed database comprising the deduplicated data items from each node and the one or more corrections.

RELATED APPLICATIONS

This application is related to and claims priority to U.S. ProvisionalPatent Application 62/216,096, titled “CONSISTENT DEDUPLICATED SNAPSHOTGENERATION FOR A DISTRIBUTED DATABASE USING OPTIMISTIC DEDUPLICATION,”filed Sep. 9, 2015, and which is hereby incorporated by reference in itsentirety.

TECHNICAL BACKGROUND

Generating snapshots of a distributed database may be difficult due, inpart, to the database not being strongly consistent across the variousnodes of the distributed database. That is, at any one time, datachanges on one or more of the nodes may not be fully synchronized withother nodes and are therefore inconsistent with those other nodes.Additionally, snapshots are difficult since it is impossible to capturethe states of all nodes at exactly the same time without freezing datachanges on the nodes while the snapshot is generated. It is notpracticable to freeze large databases for the amount of time needed togenerate a snapshot. Moreover, in distributed database, each datausually has multiple copies. To improve the space utilization, thesnapshot should get rid of the redundancy and contain only one piece ofthe data. Therefore, to generate a consistent deduplicated snapshot,each node is typically scanned multiple times to ensure consistency,which involves a relatively large amount of time and processing power.

OVERVIEW

Embodiments disclosed herein provide systems, methods, and computerreadable media for generating a consistent deduplicated snapshot of adistributed database using optimistic deduplication. In a particularembodiment, a method provides, for each node of a plurality of nodes inthe distributed database, deduplicating data items stored on the nodethat are owned by the node and generating a summary that describes afile in which the data items are located. The method further providesidentifying from the summaries for each of the nodes whether one or moremistakes occurred during deduplication and, upon identifying one or moremistakes, determining one or more corrections for the one or moremistakes. Also, the method provides generating a consistent deduplicatedsnapshot for the distributed database comprising the deduplicated dataitems from each node and the one or more corrections.

In some embodiments, identifying the one or more mistakes comprisesdetermining a quorum indicating a minimum amount of the plurality ofnodes on which a particular data item is stored and using the summariesto determine whether data items of the plurality of data items meet thequorum.

In some embodiments, identifying the one or more mistakes furthercomprises, for particular data items that do not meet the quorum,identifying the particular data items for inclusion in the one or moremistakes.

In some embodiments, determining the one or more corrections comprises,for the particular data items, determining that the particular dataitems should be excluded from the deduplicated data items and creating acorrection to exclude the particular data item from the deduplicateddata items.

In some embodiments, identifying the one or more mistakes furthercomprises, for particular data items that do meet the quorum and are notincluded in the deduplicated data items from each node, identifying theparticular data items for inclusion in the one or more mistakes.

In some embodiments, determining the one or more corrections comprises,for the particular data items, determining that the particular dataitems should be included in the deduplicated data items and creating acorrection to include the particular data item in the deduplicated dataitems.

In some embodiments, generating the consistent deduplicated snapshotcomprises applying the one or more corrections to the deduplicated dataitems before storing the consistent deduplicated snapshot.

In some embodiments, generating the consistent deduplicated snapshotcomprises storing the one or more corrections in association with thededuplicated data items, wherein the one or more corrections are made tothe deduplicated data items upon restoration to the deduplicatedsnapshot.

In some embodiments, the method further includes storing the consistentdeduplicated snapshot to a version storage repository.

In another embodiment, a system including one or more computer readablestorage media and a processing system operatively coupled with the oneor more computer readable storage media is provided. Programinstructions stored on the one or more computer readable storage media,when read and executed by the processing system, direct the processingsystem to at least, for each node of a plurality of nodes in thedistributed database, deduplicate data items stored on the node that areowned by the node and generating a summary that describes a file inwhich the data items are located. The program instructions furtherdirect the processing system to identify from the summaries for each ofthe nodes whether one or more mistakes occurred during deduplicationand, upon identifying the one or more mistakes, determine one or morecorrections for the one or more mistakes. Also, the program instructionsdirect the processing system to generate a consistent deduplicatedsnapshot for the distributed database comprising the deduplicated dataitems from each node and the one or more corrections.

BRIEF DESCRIPTION OF THE DRAWINGS

Many aspects of the disclosure can be better understood with referenceto the following drawings. While several implementations are describedin connection with these drawings, the disclosure is not limited to theimplementations disclosed herein. On the contrary, the intent is tocover all alternatives, modifications, and equivalents.

FIG. 1 illustrates a computing environment for generating a consistentdeduplicated snapshot of a distributed database using optimisticdeduplication.

FIG. 2 illustrates an operation of the computing environment to generatea consistent deduplicated snapshot of a distributed database usingoptimistic deduplication.

FIG. 3 illustrates another operation of the computing environment forgenerating a consistent deduplicated snapshot of a distributed databaseusing optimistic deduplication.

FIG. 4 illustrates yet another operation of the computing environmentfor generating a consistent deduplicated snapshot of a distributeddatabase using optimistic deduplication.

FIG. 5 illustrates a further operation of the computing environment forrecovering a distributed database using a consistent deduplicatedsnapshot.

FIG. 6 illustrates a snapshot system for generating a consistentdeduplicated snapshot of a distributed database using optimisticdeduplication.

DETAILED DESCRIPTION

As noted above, a distributed database is generally not consistent sinceit takes time for changes to data on any given node to propagate toother nodes of the database. Moreover, the distributed nature of thedatabase nodes makes it impossible to capture a snapshot of each node atthe exact same time without freezing the database, which is not apractical solution. While it may be possible to create a consistentsnapshot by scanning each node's data multiple times, that approach isvery time and processor intensive. In contrast, the examples providedherein generate a consistent deduplicated snapshot of a distributeddatabase to a level of consistency desired by a user while onlyrequiring a single data scanning pass of each node.

FIG. 1 illustrates computing environment 100 in an example scenario forgenerating a consistent deduplicated snapshot of a distributed databaseusing optimistic deduplication. Computing environment 100 includessnapshot system 101, distributed database 102, and version storagerepository 103. Distributed database 102 is made up of nodes102-1-102-N. Snapshot system 101 and distributed database 102communicate over communication links 111. Snapshot system 101 andversion storage repository 103 communicate over communication link 112.

Distributed database 102 may be a NoSQL distributed database, such asCassandra or Mongo databases and the like. For example, nodes102-1-102-N may be nodes of a Cassandra database cluster. Data itemsstored in a typical distributed database are often replicated across thenodes that comprise the database. Therefore, when a consistentdeduplicated snapshot is to be generated of database 102's data, somedata that needs to be replicated to other nodes may not have propagatedto all intended nodes. Snapshot system 101 therefore generates asnapshot of distributed database 102 by optimistically capturing data ateach node independently of other nodes. That is, snapshot system 101captures data for a given node regardless of whether at least some ofthat data should actually be included in the consistent deduplicatedsnapshot. Snapshot system 101 then corrects the captured data withouthaving to rescan nodes based on information gleaned from the datacaptured from other nodes.

FIG. 2 illustrates operation 200 of computing environment 100 togenerate a consistent deduplicated snapshot of a distributed databaseusing optimistic deduplication. Operation 200 may be performedperiodically to generate a consistent deduplicated snapshot, uponinstruction of a user, upon an event occurring in distributed database102, or for some other reason. In operation 200, for each node indistributed database 102, snapshot system 101 deduplicates data itemsstored on the node that are owned by the node (step 201). The data itemsmay include all data items stored on the node or may be a subset of thedata items, such as data items that have changed since a previoussnapshot, as may be the case snapshot system 101 is an incrementalversioning system. Snapshot system 101 then generates a summary thatdescribes a location of the data items (step 202). The summary mayinclude information identifying data items that are not owned but arerelevant to the consistent deduplicated snapshot (e.g. have been changedsince a previous snapshot), may include information identifying thededuplicated data items, or any other information that may be relevantto the consistent deduplicated snapshot process.

Snapshot system 101 uses the summaries from each of nodes 102-1-102-N toidentify for each of the nodes whether mistakes occurred duringdeduplication (step 203). A mistake may be a data item that is capturedby the deduplication that should not be included in a consistentdeduplicated snapshot. Alternatively, the mistake may be a data itemthat was left out but should be included. Since the summaries includeinformation describing the data items at each node, it can be determinedrelatively quickly which nodes include which data items. In one example,snapshot system 101 may use a quorum to identify mistakes from thesummaries. The quorum may be provided by a user of snapshot system 101or set by some other means. The quorum indicates a minimum number ofnodes in distributed database 102 that include a particular data item inorder for that data item to be included in the consistent deduplicatedsnapshot. Accordingly, if a data item that is found in the scan of atleast one of nodes 102-1-102-N does not reach the quorum, as indicatedin the summaries, then that data item is considered a mistake. Forinstance, if only nodes 102-1, 102-2, and 102-3 include a particulardata item and the quorum is set to five, then that particular data itemis a mistake to include in the consistent deduplicated snapshot.

Upon identifying one or more mistakes, snapshot system 101 determinesone or more corrections for the one or more mistakes (step 204). Asnoted in the example above, a mistake may be a data item that should notbe included in the consistent deduplicated snapshot. A correctiontherefore corrects that mistake by removing the data item from theconsistent deduplicated snapshot, by providing an instruction to removethe data item should the consistent deduplicated snapshot ever be usedfor a database restore, or by using some other means of fixing themistake. Similarly, if a data item is left out when it should beincluded, the correction may correct the mistake by including the dataitem in the consistent deduplicated snapshot, by providing aninstruction to include the data item should the consistent deduplicatedsnapshot ever be used for a database restore, or by using some othermeans of fixing the mistake. A data item may be left out if the nodethat is the owner of the data item does not include the data item whenscanned for deduplication at step 201 but the data item is included inenough of the other nodes to meet the quorum requirement.

Snapshot system 101 then generates a consistent deduplicated snapshotfor the distributed database comprising the deduplicated data items fromeach node and the one or more corrections (step 205). The deduplicateddata items from each node do not require further deduplication sinceonly one node will be the owner of any one data item. Thus, the dataitems are already deduplicated for the entire distributed database 102.In some examples, the corrections may be made to the data items when theconsistent deduplicated snapshot is stored. Although, in other examples,the data items may be stored without the corrections having been madeand the corrections may be stored in association with the data items aspart of the consistent deduplicated snapshot. In those examples, thecorrections are only applied to the data items when database 102 is tobe restored using the consistent deduplicated snapshot.

Once created, the consistent deduplicated snapshot may be stored toversion storage repository 103, which is configured to store snapshotsgenerated by snapshot system 101 as versions of distributed database102. Should distributed database 102 require recovery to a point in timecaptured by one of the stored versions, that version need merely beretrieved from version storage repository 103 to repopulate distributeddatabase 102 with the data stored therein.

Advantageously, whenever a consistent deduplicated snapshot is to begenerated of distributed database 102, each of nodes 102-1-102-N needonly be scanned once for data items. Using the quorum requirement,snapshot system 101 can determine whether data items not propagated toall nodes should still be included in the consistent deduplicatedsnapshot rather than rescanning to determine whether the data items didin fact propagate.

FIG. 3 illustrates an operation 300 of computing environment 100 forgenerating a consistent deduplicated snapshot of a distributed databaseusing optimistic deduplication. Operation 300 shows three nodes 301-303which are part of a distributed database similar to database 102 inFIG. 1. Node 301 includes data items A-D, node 302 includes data itemsA-C and E, and node 303 also includes data items A-C and E. For thepurposes of this example, node 301 owns data items A and D, node 302owns data items B and C, and node 303 owns data item E. The quorum inthis example is set to two, so any one data item must occur two or moretimes in the database for that item to be included in the consistentdeduplicated snapshot.

At step 1, when a consistent deduplicated snapshot is to be generatedfor the database, each of nodes 301-303 are scanned to deduplicate theirrespective owned data items. Data items A and D are deduplicated fornode 301, data items B and C are deduplicated for node 302, and dataitem E is deduplicated for node 303. These deduplicated data items arestored as optimistic snapshot 304 having data items A-E. Further at step1, summaries are generated for each node 301-303 describing a file fromwhich the data items were scanned. In this case, each node only includesone file for the data items, however, other examples may includemultiple files having data items and a separate summary would begenerated for each of these multiple files along with the deduplicationprocess performed on each of those files. The summary for node 301 inthis example describes that node 301 includes items A-D, the summary fornode 302 describes that node 302 includes items A-C and E, and thesummary for node 303 describes that node 303 likewise includes items A-Cand E.

At step 2, mistakes are identified from the summaries generated at step1. Specifically, the summaries indicate that data item D is included atnode 301 but not nodes 302 and 303. Thus, data item D only occurs oncein the database, which is lower than the quorum requirement of two.Optimistic snapshot 304 is therefore corrected by removing data item Dfrom optimistic snapshot 304 to form corrected snapshot 305, which is aconsistent deduplicated snapshot. As noted above, corrected snapshot 305may be stored as the result of the snapshot creation process oroptimistic snapshot 304 may be stored along with the correctionsgenerated at step 2 for use to correct optimistic snapshot 304 whenrestoring data from optimistic snapshot 304.

FIG. 4 illustrates operation 400 of computing environment 100 forgenerating a consistent deduplicated snapshot of a distributed databaseusing optimistic deduplication. In operation 400, data items from nodes102-1-102-N that are to be stored as a snapshot are included in SortedStrings Tables (SSTables) 402. Information describing the data itemsstored on each of nodes 102-1-102-N may further be included in SSTables402 or may be included in separate summary files. At step 1, snapshotsystem 101 processes SSTables 402 using deduplication/quorum processinglogic 401. Logic 401 operates on SSTables 402 to deduplicate data itemsstored therein and determine whether the data items in SSTables 402 meeta quorum requirement. The quorum requirement may be preset in logic 401,may be received by a user/administrator of snapshot system 101, may beadaptive depending on the number of nodes in distributed database 102,or may be determined in some other manner.

At step 2 a, the resultant deduplicated SSTables 432 are included insnapshot 403. Likewise, at step 2 b, corrections 431 are also stored aspart of snapshot 403 (e.g. as a separate correction file with snapshot403 acting as a container of both corrections 431 and deduplicatedSSTables 432). Corrections 431 indicate data items in deduplicatedSSTables 432 that should not be included when restoring distributeddatabase 102 using snapshot 403. Also, corrections 431 indicate dataitems that should be included in deduplicated SSTables 432 whenrestoring distributed database 102 using snapshot 403. A data item maynot have been included if the data item was not owned by any of nodes102-1-102-N but still existed on enough nodes to meet the quorumrequirement. In those cases, corrections 431 may not only includeindications that one or more data items should have been included indeduplicated SSTables 432 but also may include the data itemsthemselves. Once identified, those data items may need to be requestedfrom at least one of their storing nodes 102-1-102-N in order forsnapshot system 401 to include them in corrections 431.

Operation 400, as described above, is therefore different than operation300 in that operation 300 would have applied corrections 431 todeduplicated SSTables 432 before storing snapshot 403 to version storagerepository 103, which eliminates the need to store corrections 431 insnapshot 403. In contrast, operation 400 allows deduplicated SSTables432 to remain “as is” and simply stores corrections 431 in snapshot 403for use in the event snapshot 403 is ever needed for recovery.

FIG. 5 illustrates operation 500 of computing environment 100 forrecovering a distributed database using a consistent deduplicatedsnapshot. Specifically, operation 500 describes how snapshot 403 createdabove is used to recover distributed database 102. In operation 500,snapshot system 101 is also used for recovering distributed database 102from snapshots stored in version storage repository 103. However, analternative system may be employed for the recovery process.

Once snapshot system 101 receives an instruction to recover distributeddatabase 102 using snapshot 403, snapshot system 101 retrieves snapshot403 from version storage repository 103. At step 1, snapshot system 101recovers nodes 102-1-102-N using the data items in deduplicated SSTables432. Due to the deduplicated nature of the data items, a single dataitem in deduplicated SSTables 432 may need to be replicated acrossmultiple nodes depending on which node had stored the data item whensnapshot 403 was created. After the data items have been recovered tonodes 102-1-102-N, snapshot system 101 applies corrections 431 at step2. The application of corrections 431 may include deleting data itemsfrom nodes 102-1-102-N that did not meet the quorum requirement and/oradding data items that did meet the quorum requirement but were notincluded in deduplicated SSTables 432. In some examples, the applicationof corrections 431 may be performed in conjunction with the recovery ofdata items. For instance, in those examples, a correction that indicatesa particular data item should not be included will simply prevent thatdata item from being recovered to any of nodes 102-1-102-N in the firstplace rather than deleting it later on.

If operation 500 had instead described the recovery of distributeddatabase 102 using a snapshot generated in a manner described byoperation 300, there would not be any corrections 431 to apply. That is,corrections 431 will have already been applied to deduplicated SSTables432 before storing the snapshot in version storage repository 103. Thus,recovering such a snapshot would merely require recovering the dataitems in already corrected deduplicated SSTables 432.

Referring back to FIG. 1, snapshot system 101 comprises a computersystem and communication interface. Snapshot system 101 may also includeother components such as a router, server, data storage system, andpower supply. Snapshot system 101 may reside in a single device or maybe distributed across multiple devices. Snapshot system 101 could be anapplication server(s), a personal workstation, or some other networkcapable computing system—including combinations thereof. While shownseparately, all or portions of snapshot system 101 could be integratedwith the components of at least one of nodes 102-1-102-N.

Nodes 102-1-102-N of distributed database 102 each comprise one or moredata storage systems having one or more non-transitory storage medium,such as a disk drive, flash drive, magnetic tape, data storagecircuitry, or some other memory apparatus. The data storage systems mayalso include other components such as processing circuitry, a networkcommunication interface, a router, server, data storage system, userinterface and power supply. The data storage systems may reside in asingle device or may be distributed across multiple devices.

Version storage repository 103 likewise comprises a data storage systemhaving one or more non-transitory storage medium, such as a disk drive,flash drive, magnetic tape, data storage circuitry, or some other memoryapparatus. Version storage repository 103 may also include othercomponents such as processing circuitry, a network communicationinterface, a router, server, data storage system, user interface andpower supply. Version storage repository 103 may reside in a singledevice or may be distributed across multiple devices. Also, while shownseparately, version storage repository 103 may be incorporated intosnapshot system 101.

Communication links 111-112 could be internal system busses or usevarious communication protocols, such as Time Division Multiplex (TDM),Internet Protocol (IP), Ethernet, communication signaling, Code DivisionMultiple Access (CDMA), Evolution Data Only (EVDO), WorldwideInteroperability for Microwave Access (WIMAX), Global System for MobileCommunication (GSM), Long Term Evolution (LTE), Wireless Fidelity(WIFI), High Speed Packet Access (HSPA), or some other communicationformat—including combinations thereof. Communication links 111-112 couldbe direct links or may include intermediate networks, systems, ordevices.

FIG. 6 illustrates snapshot system 600. Snapshot system 600 is anexample of snapshot system 101, although system 101 may use alternativeconfigurations. Snapshot system 600 comprises communication interface601, user interface 602, and processing system 603. Processing system603 is linked to communication interface 601 and user interface 602.Processing system 603 includes processing circuitry 605 and memorydevice 606 that stores operating software 607.

Communication interface 601 comprises components that communicate overcommunication links, such as network cards, ports, RF transceivers,processing circuitry and software, or some other communication devices.Communication interface 601 may be configured to communicate overmetallic, wireless, or optical links. Communication interface 601 may beconfigured to use TDM, IP, Ethernet, optical networking, wirelessprotocols, communication signaling, or some other communicationformat—including combinations thereof.

User interface 602 comprises components that interact with a user. Userinterface 602 may include a keyboard, display screen, mouse, touch pad,or some other user input/output apparatus. User interface 602 may beomitted in some examples.

Processing circuitry 605 comprises microprocessor and other circuitrythat retrieves and executes operating software 607 from memory device606. Memory device 606 comprises a non-transitory storage medium, suchas a disk drive, flash drive, data storage circuitry, or some othermemory apparatus. Operating software 607 comprises computer programs,firmware, or some other form of machine-readable processinginstructions. Operating software 607 includes deduplication andcorrection module 608 and snapshot generation module 609. Operatingsoftware 607 may further include an operating system, utilities,drivers, network interfaces, applications, or some other type ofsoftware. When executed by circuitry 605, operating software 607 directsprocessing system 603 to operate Snapshot system 600 as describedherein.

In particular, deduplication and correction module 608 directsprocessing system 603 to, for each node of a plurality of nodes in thedistributed database, deduplicate data items stored on the node that areowned by the node and generating a summary that describes a file inwhich the data items are located and identify from the summaries foreach of the nodes whether one or more mistakes occurred duringdeduplication. Upon identifying the one or more mistakes, deduplicationand correction module 608 directs processing system 603 to determine oneor more corrections for the one or more mistakes. Snapshot generationmodule 609 directs processing system 603 to generate a consistentdeduplicated snapshot for the distributed database comprising thededuplicated data items from each node and the one or more corrections.

The above description and associated figures teach the best mode of theinvention. The following claims specify the scope of the invention. Notethat some aspects of the best mode may not fall within the scope of theinvention as specified by the claims. Those skilled in the art willappreciate that the features described above can be combined in variousways to form multiple variations of the invention. As a result, theinvention is not limited to the specific embodiments described above,but only by the following claims and their equivalents.

What is claimed is:
 1. A method of generating a consistent deduplicatedsnapshot of a distributed database using optimistic deduplication, themethod comprising: for each node of a plurality of nodes in thedistributed database, deduplicating data items that are identified asbeing stored on the node and owned by the node, and generating a summaryfor the node, the summary describing a file in which the data items thatare identified as being stored on the node are located; identifying fromthe summaries for each of the nodes whether one or more mistakesoccurred during deduplication; upon identifying the one or moremistakes, determining one or more corrections for the one or moremistakes; and generating a consistent deduplicated snapshot for thedistributed database comprising the deduplicated data items from eachnode and the one or more corrections.
 2. The method of claim 1, whereinidentifying the one or more mistakes comprises: determining a quorumindicating a minimum amount of the plurality of nodes on which aparticular data item is stored; and using the summaries to determinewhether data items of the plurality of data items meet the quorum. 3.The method of claim 2, wherein identifying the one or more mistakesfurther comprises: for particular data items that do not meet thequorum, identifying the particular data items for inclusion in the oneor more mistakes.
 4. The method of claim 3, wherein determining the oneor more corrections comprises: for the particular data items,determining that the particular data items should be excluded from thededuplicated data items and creating a correction to exclude theparticular data item from the deduplicated data items.
 5. The method ofclaim 2, wherein identifying the one or more mistakes further comprises:for particular data items that do meet the quorum and are not includedin the deduplicated data items from each node, identifying theparticular data items for inclusion in the one or more mistakes.
 6. Themethod of claim 5, wherein determining the one or more correctionscomprises: for the particular data items, determining that theparticular data items should be included in the deduplicated data itemsand creating a correction to include the particular data item in thededuplicated data items.
 7. The method of claim 1, wherein generatingthe consistent deduplicated snapshot comprises: applying the one or morecorrections to the deduplicated data items before storing the consistentdeduplicated snapshot.
 8. The method of claim 1, wherein generating theconsistent deduplicated snapshot comprises: storing the one or morecorrections in association with the deduplicated data items, wherein theone or more corrections are made to the deduplicated data items uponrestoration to the deduplicated snapshot.
 9. The method of claim 1,further comprising: storing the consistent deduplicated snapshot to aversion storage repository.
 10. A system for generating a consistentdeduplicated snapshot of a distributed database using optimisticdeduplication, the system comprising: one or more computer readablestorage media; a processing system operatively coupled with the one ormore computer readable storage media; and program instructions stored onthe one or more computer readable storage media that, when read andexecuted by the processing system, direct the processing system toperform operations comprising: for each node of a plurality of nodes inthe distributed database, deduplicating data items that are identifiedas being stored on the node and owned by the node and generating asummary for the node, the summary describing a file in which the dataitems that are identified as being stored on the node are located;identifying from the summaries for each of the nodes whether one or moremistakes occurred during deduplication; upon identifying the one or moremistakes, determining one or more corrections for the one or moremistakes; and generating a consistent deduplicated snapshot for thedistributed database comprising the deduplicated data items from eachnode and the one or more corrections.
 11. The system of claim 10,wherein the identifying the one or more mistakes includes the programinstructions directing the processing system to perform operationscomprising: determining a quorum indicating a minimum amount of theplurality of nodes on which a particular data item is stored; anddetermining whether data items of the plurality of data items meet thequorum based on the summaries.
 12. The system of claim 11, wherein theidentifying the one or more mistakes includes the program instructionsfurther directing the processing system to perform operationscomprising: for particular data items that do not meet the quorum,identifying the particular data items for inclusion in the one or moremistakes.
 13. The system of claim 12, wherein the determining the one ormore corrections includes the program instructions directing theprocessing system to perform operations comprising: for the particulardata items, determining that the particular data items should beexcluded from the deduplicated data items and creating a correction toexclude the particular data item from the deduplicated data items. 14.The system of claim 11, wherein the identifying the one or more mistakesincludes the program instructions directing the processing system toperform operations comprising: for particular data items that do meetthe quorum and are not included in the deduplicated data items from eachnode, identifying the particular data items for inclusion in the one ormore mistakes.
 15. The system of claim 14, wherein the determining theone or more corrections includes the program instructions directing theprocessing system to perform operations comprising: for the particulardata items, determining that the particular data items should beincluded in the deduplicated data items and creating a correction toinclude each of the particular data items in the deduplicated dataitems.
 16. The system of claim 10, wherein the generating the consistentdeduplicated snapshot includes the program instructions directing theprocessing system to perform operations comprising: applying the one ormore corrections to the deduplicated data items before storing theconsistent deduplicated snapshot.
 17. The system of claim 10, whereinthe generating the consistent deduplicated snapshot includes the programinstructions directing the processing system to perform operationscomprising: storing the one or more corrections in association with thededuplicated data items, wherein the one or more corrections are made tothe deduplicated data items upon restoration to the deduplicatedsnapshot.
 18. The system of claim 10, wherein the program instructionsfurther direct the processing system to perform operations comprising:storing the consistent deduplicated snapshot to a version storagerepository.